Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs
نویسندگان
چکیده
We address action temporal localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in action temporal localization via multi-stage segment-based 3D ConvNets: (1) a proposal stage identifies candidate segments in a long video that may contain actions; (2) a classification stage learns one-vs-all action classification model to serve as initialization for the localization stage; and (3) a localization stage fine-tunes on the model learnt in the classification stage to localize each action instance. We propose a novel loss function for the localization stage to explicitly consider temporal overlap and therefore achieve high temporal localization accuracy. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increased from 15.0% to 19.0% on THUMOS 2014, when the overlap threshold for evaluation is set to 0.5.
منابع مشابه
Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery
This paper presents a computationally efficient approach for temporal action detection in untrimmed videos that outperforms state-of-the-art methods by a large margin. We exploit the temporal structure of actions by modeling an action as a sequence of sub-actions. A novel and fully automatic sub-action discovery algorithm is proposed, where the number of sub-actions for each action as well as t...
متن کاملWhat if we do not have multiple videos of the same action? — Video Action Localization Using Web Images
This paper tackles the problem of spatio-temporal action localization in a video, without assuming the availability of multiple videos or any prior annotations. Action is localized by employing images downloaded from internet using action name. Given web images, we first dampen image noise using random walk and evade distracting backgrounds within images using image action proposals. Then, give...
متن کاملUntrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge
Current state-of-the-art human activity recognition is focused on the classification of temporally trimmed videos in which only one action occurs per frame. We propose a simple, yet effective, method for the temporal detection of activities in temporally untrimmed videos with the help of untrimmed classification. Firstly, our model predicts the top k labels for each untrimmed video by analysing...
متن کاملHand Gesture Recognition from RGB-D Data using 2D and 3D Convolutional Neural Networks: a comparative study
Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total fr...
متن کاملTemporal Segment Networks for Action Recognition in Videos
Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1601.02129 شماره
صفحات -
تاریخ انتشار 2016